Overview

  • Set up
  • Genomic Heatmaps
  • Making a simple heatmap
  • More complex heatmaps

class: inverse, center, middle

Set up


Materials

All prerequisites, links to material and slides for this course can be found on github.

Or can be downloaded as a zip archive from here.

Presentations, source code and practicals

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

  • presentations/slides/ Presentations as an HTML slide show.
  • presentations/singlepage/ Presentations as an HTML single page.
  • presentations/r_code/ R code in presentations.

Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived GenomicHeatmapsAndProfiles folder in the Rstudio menu.

Session -> Set Working Directory -> Choose Directory

or in the console.

class: inverse, center, middle

Raw data to visualisations


Genomics data visualisations

A common task in genomics is to review your data in your favourite genome browser. Here we are using IGV and our material is available here

offset

Genomics data visualisations

To visualise our data we first need to get our data in a suitable format for visualisation. For more information on file formats see our material is available here

High throughput Sequencing data is typically delivered as unaligned sequences in FastQ format.

FastQ (FASTA with Qualities)

igv

  • “@” followed by identifier.
  • Sequence information.
  • “+”
  • Quality scores encodes as ASCI.

Genomics data visualisations

We need to work with aligned data for visualisation and analysis in our genome of choice. Aligned sequence data is often stored in SAM format.

SAM - Sequence Alignment Map

igv

  • Contains read and alignment information including genomic location

Genomics data visualisations

Typically for visualisation in genome browsers we require only a subset of this information - Depth of reads/Signal at genomic locations. We can then take advantage of more reduced formats such as bedGraph (an extension of BED).

bed(Browser Extensible Data)Graph

.pull-left[

igv

] .pull-right[ - BED 3 format - Chromosome - Start - End

4th column - Score ]

Genomics data visualisations

These format are easy for us to read but inefficent for software to work with. When using software we will most often work with highly compressed and indexed equivalent files.

  • SAM – BAM (with a .bai index)
  • wig/bedGraph – bigWig

class: inverse, center, middle

Raw data to visualisations in R (Quick)


Full workflows available.

In the next few slides we are going to run through an alignment of ChIP-seq FastQ to sorted and indexed BAM files and then create a bigWig.

For a full summary of this processing please see our relevant courses

A reference genome.

To align our raw fastQ data we will need a reference genome in FASTA format to align against.

We can retrieve this from Ensembl FTP download page for many organisms.

igv

FastQ to BAM and bigWigs.

Once we have the FASTA file available we can prepare the index of FASTA for alignment using the Rsubread package. The indexSplit paramter here will save us memory in alignment but may cost us some speed.

FastQ to BAM and bigWigs.

Now we have a indexed BAM file we can easily extract the total mapped reads.

## [1] 10302212

Visualising signal over multiple genomic regions


Review in IGV

We can easily review our signal data in IGV at a simgle locus to assess the relationship between samples/conditions/antibodies.

offset

Multipanel review in IGV

In IGV we can also review multiple loci in one screen to get an overview of signal over a group of regions.

offset

Visualisation in heatmaps

Often we want to review signal within and between samples/conditions over 1000s of sites. To do this we can take advantage of genomic heatmaps and profiles.

.pull-left[ Two major softwares to produce genomic heatmaps * Deeptools * EnrichedHeatmap ] .pull-right[

igv

igv

]

ProfilePlyr package

Many requests for complex heatmaps and summary statistics of data visualised within.

.pull-left[ Profileplyr is available from Bioconductor and developed with Doug Barrows to - Allow import/export between Deeptools and R framework - Provide easy tools to manipulate and summarise heatmap data. ] .pull-right[

igv

]

Getting hold of DeepTools

You will not require Deeptools to follow the course but if you would like to install this can be done easily through Conda or Docker.

For more information on Deeptools you can see our course here.

igv

Getting hold of data.

First we need to get hold of some data to plot.

For demonstration we will take a small processed data set of bigWigs from a paper containing ChIP-seq for the transcription factos ZBTB1 and ATf4.

We can retrieve raw and processed data for GSM4332935, GSM4332934, GSM4332940, GSM4332950 and GSM4332944 directly from GEO.

igv

Getting hold of data.

We can also use some of the tools in Bioconductor to retrieve supplementary files from GEO directly.

To do this we will use the GEOquery package. We can install this following the commands on the Bioconductor page for GEOquery.

Regions of interest

We want to plot our signal over selected sets of regions. We will import our regions from a BED file using the import.bed function in rtracklayer package

## GRanges object with 593 ranges and 2 metadata columns:
##         seqnames              ranges strand |                      name
##            <Rle>           <IRanges>  <Rle> |               <character>
##     [1]     chr2   70660347-70663275      * |    chr2:70660347-70663275
##     [2]    chr12     6036980-6042542      * |     chr12:6036980-6042542
##     [3]    chr13 113526647-113528481      * | chr13:113526647-113528481
##     [4]     chr5       881483-884047      * |        chr5:881483-884047
##     [5]    chr17   79397761-79400379      * |   chr17:79397761-79400379
##     ...      ...                 ...    ... .                       ...
##   [589]    chr11     9024476-9026040      * |     chr11:9024476-9026040
##   [590]     chr1   58387303-58388627      * |    chr1:58387303-58388627
##   [591]    chr17   32834453-32835985      * |   chr17:32834453-32835985
##   [592]     chrY   13683928-13693098      * |    chrY:13683928-13693098
##   [593]    chr10   90901936-90905010      * |   chr10:90901936-90905010
##             score
##         <numeric>
##     [1]         0
##     [2]         0
##     [3]         0
##     [4]         0
##     [5]         0
##     ...       ...
##   [589]         0
##   [590]         0
##   [591]         0
##   [592]         0
##   [593]         0
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

GRanges

The BED file regions are stored in R as a GRanges object.

GRanges objects provide much of the functionality seen in BedTools.

## GRanges object with 593 ranges and 2 metadata columns:
##         seqnames              ranges strand |                      name
##            <Rle>           <IRanges>  <Rle> |               <character>
##     [1]     chr2   70660347-70663275      * |    chr2:70660347-70663275
##     [2]    chr12     6036980-6042542      * |     chr12:6036980-6042542
##     [3]    chr13 113526647-113528481      * | chr13:113526647-113528481
##     [4]     chr5       881483-884047      * |        chr5:881483-884047
##     [5]    chr17   79397761-79400379      * |   chr17:79397761-79400379
##     ...      ...                 ...    ... .                       ...
##   [589]    chr11     9024476-9026040      * |     chr11:9024476-9026040
##   [590]     chr1   58387303-58388627      * |    chr1:58387303-58388627
##   [591]    chr17   32834453-32835985      * |   chr17:32834453-32835985
##   [592]     chrY   13683928-13693098      * |    chrY:13683928-13693098
##   [593]    chr10   90901936-90905010      * |   chr10:90901936-90905010
##             score
##         <numeric>
##     [1]         0
##     [2]         0
##     [3]         0
##     [4]         0
##     [5]         0
##     ...       ...
##   [589]         0
##   [590]         0
##   [591]         0
##   [592]         0
##   [593]         0
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Overlaps in GRanges

One of the most useful function for GRanges objects is the ability to intersect/overlap differing regions.

To demonstate first we will read in a BED file of promoter positions for hg19.

## GRanges object with 23056 ranges and 2 metadata columns:
##           seqnames              ranges strand |        name     score
##              <Rle>           <IRanges>  <Rle> | <character> <numeric>
##       [1]    chr19   58874015-58876214      - |           1         0
##       [2]     chr8   18246755-18248954      + |          10         0
##       [3]    chr20   43280177-43282376      - |         100         0
##       [4]    chr18   25757246-25759445      - |        1000         0
##       [5]     chr1 244006687-244008886      - |       10000         0
##       ...      ...                 ...    ... .         ...       ...
##   [23052]     chr9 115095745-115097944      - |        9991         0
##   [23053]    chr21   35734323-35736522      + |        9992         0
##   [23054]    chr22   19109768-19111967      - |        9993         0
##   [23055]     chr6   90537619-90539818      + |        9994         0
##   [23056]    chr22   50964706-50966905      - |        9997         0
##   -------
##   seqinfo: 39 sequences from an unspecified genome; no seqlengths

Overlaps in GRanges

The function %over% allows us to evaluate which regions overlap/intersect between two region sets.

  • A[A %over% B] – Returns all regions in A which overlap regions in B
  • B[B %over% A] – Returns all regions in B which overlap regions in A
  • A[!A %over% B] – Returns all regions in A which do not overlap regions in B

Overlaps in GRanges

So in practice we can easily extract our ZBTB1 peaks which do or do not overlap with promoters. We can then use the export.bed function to write the overlapping GRanges to a BED file

Heatmaps in Deeptools

The DeepTools software provides a set of tools to convert between file types (i.e. BAM to bigWig) and to plot signal from BAM or bigWig over a set of regions as a heatmap.

Deeptools course for more information

igv

Heatmaps in Deeptools

Two main workhorse functions allow you create these heatmaps in deeptools

  • computeMatrix - Create an intermediate file of signal for heatmap.
  • plotHeatmap - Generate the heatmaps from intermediate file.

Creating GenomicHeatmaps

The computeMatrix function take a set of arguments to generate an intermediate file

  • reference-point –referencePoint = Specify position in regions in BED file to centre heatmap.
  • -bs = Binsize across regions in BED file
  • -b and -a = bp before and after reference point to plot in heatmap.
  • -S = BigWigs to plot signal from
  • –regionsFileName = BED file containing regions to plot over.
  • –outFileName = Name of intermediate file generated.

Creating GenomicHeatmaps

Once we have our intermediate file we can use this to create our heatmap using the plotHeatmap command.

  • -m = Name of intermediate file generated by Deeptools computeMatrix.
  • -o = Output file name for heatmap
  • –colorList = Bottom and top of colour scale for heatmap.

Creating GenomicHeatmaps

We our data, the plotHeatmap command would be.

igv

Creating GenomicHeatmaps

The profileplyr package offers a similar set of functionality as the Deeptools software within R. To compute our intermediate matrix we can use the BamBigwig_to_chipProfile function with the same set of inputs as Deeptools.

## class: profileplyr 
## dim: 593 200 
## metadata(0):
## assays(4): GSM4332935_Sorted_GFP_MinusN_FLAGNormalised.bw
##   GSM4332940_Sorted_GFP_PlusN_FLAGNormalised.bw
##   GSM4332945_Sorted_ZBTB_MinusN_FLAGNormalised.bw
##   GSM4332950_Sorted_ZBTB_PlusN_FLAGNormalised.bw
## rownames(593): giID450 giID451 ... giID562 giID563
## rowData names(5): name score sgGroup giID names
## colnames: NULL
## colData names(0):

Creating GenomicHeatmaps

We can then use the generateEnrichedHeatmap function to create a heatmap as we did in Deeptools.

Import from DeepTools

However you generate the intermediate file, you can import/export to Deeptools. This allows you take advantage of any functionality in Deeptools and R together.

Blacklist

ChIP-seq will often show the presence of common artefacts such as ultra-high signal regions or Blacklists. Such regions can confound peak calling, fragment length estimation and QC metrics.

igv

Blacklist

We can retrieve Blacklists from the Encode portal for human, rat and mouse. Here I have pre-downloaded the hg19 Blacklist as a file and put in the Beds directory.

## GRanges object with 411 ranges and 2 metadata columns:
##         seqnames            ranges strand |                    name     score
##            <Rle>         <IRanges>  <Rle> |             <character> <numeric>
##     [1]     chr1     564450-570371      * | High_Mappability_island      1000
##     [2]     chr1     724137-727043      * |        Satellite_repeat      1000
##     [3]     chr1     825007-825115      * |                BSR/Beta      1000
##     [4]     chr1   2583335-2634374      * |  Low_mappability_island      1000
##     [5]     chr1   4363065-4363242      * |                (CATTC)n      1000
##     ...      ...               ...    ... .                     ...       ...
##   [407]     chrY 28555027-28555353      * |                    TAR1      1000
##   [408]     chrY 28784130-28819695      * |        Satellite_repeat      1000
##   [409]     chrY 58819368-58917648      * |                (CATTC)n      1000
##   [410]     chrY 58971914-58997782      * |                (CATTC)n      1000
##   [411]     chrY 59361268-59362785      * |                    TAR1      1000
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths

Creating GenomicHeatmaps

The groupBy by function allows us to add additional information to our Heatmaps. Here we provide the GRanges of our Blacklist BED file and specify to include sites which do not overlap using the include_nonoverlapping argument.

Creating GenomicHeatmaps

We can take advantage of the same %over% function we saw earlier with GRanges to filter to just ranges in heatmap which do not overlap our blacklist regions.

Creating GenomicHeatmaps

Now we have created our filtered set we could export it back to Deeptools to plot using it’s plotHeatmap function.

Subsetting

We can subset our heatmap to only review the samples we want using indexing in R.

Our heatmaps have 3 dimension - Ranges/Rows, Bins/Columns, Samples. To subset samples then we can index our profileplyr object as in standard R.

Joining heatmaps.

We can also combine heatmaps when they are processed over the same BED file. Here we plot the ATF4 bigwig over the ZBTB1 peaks as we did earlier for ZBTB1 bigwigs.

## Loading bigwig files.
## Making ChIPprofile object from signal files.
## Importing rlelist..Done
## Filtering regions which extend outside of genome boundaries.....Done
## Filtered 0 of 593 regions
## Splitting regions by Watson and Crick strand....Done
## ..Done
## Found 593 Watson strand regions
## Found 0 Crick strand regions
## Extending regions.....done
## Calculating coverage across regions
## Calculating per contig. 
## contig: 1
## contig: 2
## contig: 3
## contig: 4
## contig: 5
## contig: 6
## contig: 7
## contig: 8
## contig: 9
## contig: 10
## contig: 11
## contig: 12
## contig: 13
## contig: 14
## contig: 15
## contig: 16
## contig: 17
## contig: 18
## contig: 19
## contig: 20
## contig: 21
## contig: 22
## contig: 23
## contig: 24
## Creating ChIPprofile.

Joining samples

We can then join our ZBTB1 and ATF4 bigwigs using the common R combine function c().

Updating sample names

Sample names in these heatmaps are by default based on the name of the input bigwigs. We can update the samplenames using the rownames() and sampleData() functions with our profileplyr object.

Updating sample names

We can directly change the sample names when plotting the heatmap too by using the sample_names argument.

Updating sample information

We can also use the sampleData() function to add any additional information to our samples. Here we add information on Antibody.

We can add any arbitary column name we need using the sampleData() function.

Updating sample information

We can use the information on samples then when we plot our heatmap. Here we tell the generateEnrichedHeatmap() function to colour our heatmaps by their Antibody information we just added.

Clustering ranges by summarised signal

Often we want to group ranges across samples by the signal within them. The profileplyr package allows the user to summarise signal within a region and cluster rows according to user defined metrics.

Here we summarise each range by taking the sum of signal within ranges specified in the fun paramter and create two clusters from the data specifed by the cutree_rows paramter.

## Hierarchical clustering used. It is advised to avoid this option with large matrices as the clustering can take a long time. Kmeans is more suitable for large matrices.
## A column has been added to the range metadata with the column name 'cluster', and the 'rowGroupsInUse' has been set to this column.

Clustering ranges by summarised signal

We can now review the clustering by simply passing the new profileplyr object to the generateEnrichedHeatmap function.

We can see that it has created two clusters which appear to contain high or low ATF4 signal.

Adding Groups with names.

If we have an additional set of regions which we wish to assess the overall ATF4 signal within, we can use the groupBy fnction again to add this information. Here we additionally specify group names in the GRanges_names argument.

## A column has been added to the range metadata with the column name 'GR_overlap_names' that specifies the GRanges each range overlaps with, but the inherited groups are not included.

Joining samples

If we wish to now review the relationship between the clustering and the GRanges group we can use the extra_annotation_columns argument to specify additional information to add to the Heatmap.

Summarising

Now we have our heatmap organised as we require, we may wish to assess the differences in signal within the clusters over the heatmap.

The summarize function allows us to again summarise the ranges by a user-defined metric and export the results as a data.frame suitbale for plotting.

##   GR_overlap_names          combined_ranges  Sample    Signal Antibody
## 1    ProximalPeaks   chr1_21042561_21046561 ZBTB1_1 191.10911    ZBTB1
## 2    ProximalPeaks chr1_181127059_181131059 ZBTB1_1 311.86117    ZBTB1
## 3    ProximalPeaks     chr1_9469096_9473096 ZBTB1_1  95.21104    ZBTB1
## 4    ProximalPeaks   chr1_44494345_44498345 ZBTB1_1 136.83766    ZBTB1

Plotting

Now we have our long format data.frame ready we can pass this to ggplot to visualise as violin plots.

Annotating ranges

So far we have been working with genomic locations without any annotation to genes they may regulate.

We can easily annotate our ranges in R using the GREAT software in profileplyr by using the annotateRanges_great function.

Annotating to genes

All the annotation is now contained in the row information for the profileplyr object and can be retieved using the rowRanges() function.

## GRanges object with 1 range and 13 metadata columns:
##           seqnames          ranges strand |                 name     score
##              <Rle>       <IRanges>  <Rle> |          <character> <numeric>
##   giID484     chr1 1508087-1512087      + | chr1:1508899-1511276         0
##                  sgGroup     giID                names  cluster
##                 <factor> <factor>             <factor> <factor>
##   giID484 ZBTB1Peaks.bed  giID484 chr1:1508899-1511276        1
##           hierarchical_order overlap_matrix GR_overlap_names      name.1
##                    <integer>       <matrix>         <factor> <character>
##   giID484                252              0       no_overlap        <NA>
##             score.1      SYMBOL distanceToTSS
##           <numeric> <character>     <numeric>
##   giID484      <NA>       SSU72           162
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths

Annotating to genes

If we want to we can write this information to a table to review in outher software or a BED file to review in IGV.

Annotating to genes

We can use the annotation to label where peaks closest to our genes of interest are in the heatmap

Grouping by genes

We can also use the annotation for grouping peaks by their nearest genes. Here we have downloaded a GMT file from GSEA’s MsigDB.

We can use the GSEAbase package to read in the data from the GMT file and the get the genes in amin oacid metabolism.

Grouping by genes

Now we have named list of the genes of interest, we can use this group our heatmap using the groupBy function and produce our final heatmap.

Here the group argument accepts a list of gene names instead of a GRanges.

Grouping by genes

The overlap is quite small so we regroup by a different column and add the gene set as annotation instead by specify “GL_overlap_names” to the extra_annotation_columns parameter

A more comprehensive workflow

Doug Barrows has compiled a fantastic workflow on his site which talks through more details.

Please work through this workflow here, and ask any questions as needed.

Contact

Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our GitHub and raise an issue.